|
Fault-tolerant computer systems are systems designed around the concepts of fault tolerance. In essence, they must be able to continue working to a level of satisfaction in the presence of faults. Fault tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to ''expect'' packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount. Recovery from errors in fault-tolerant systems can be characterised as either roll-forward or roll-back. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error. == Types of fault tolerance == Most fault-tolerant computer systems are designed to handle several possible failures, including hardware-related faults such as hard disk failures, input or output device failures, or other temporary or permanent failures; software bugs and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences or installing unexpected software and physical damage or other flaws introduced to the system from an outside source.〔Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 135 – 138 1996 ISBN 0-13-057887-8〕 Hardware fault-tolerance is the most common application of these systems, designed to prevent failures due to hardware components. Most basically, this is provided by redundancy, particularly dual modular redundancy. Typically, components have multiple backups and are separated into smaller "segments" that act to contain a fault, and extra redundancy is built into all physical connectors, power supplies, fans, etc.〔Formal Techniques in Real-Time and Fault-Tolerant Systems: Second International Symposium, Nijmegen, the Netherlands, January 8–10, 1992, Proceedings By Jan Vytopil Contributor Jan Vytopil, Published by Springer, 1991, ISBN 3-540-55092-5, 978-3-540-55092-1 〕 There are special software and instrumentation packages designed to detect failures, such as fault masking, which is a way to ignore faults by seamlessly preparing a backup component to execute something as soon as the instruction is sent, using a sort of voting protocol where if the main and backups don't give the same results, the flawed output is ignored. Software fault-tolerance is based more around nullifying programming errors using real-time redundancy, or static "emergency" subprograms to fill in for programs that crash. There are many ways to conduct such fault-regulation, depending on the application and the available hardware.〔Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 221 – 235 1996 ISBN 0-13-057887-8〕 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Fault-tolerant computer system」の詳細全文を読む スポンサード リンク
|